=======================================================
Scott Burns
Udacity – Data Analyst Nanodegree
Project 3: Data Analysis with R
========================================================
I live in the city of Oakland, California, and plan to stay here. I’m curious about crime trends in the city, as they may have a direct impact on my life and important decisions I encounter in the future. Thus as I was exploring dataset options for this project, I was excited to uncover this dataset on the website OpenOakland.org - which has records of all crime reports from the city of Oakland from 2007 to earlier this summer, including incident details such as geographic location and crime type. It looked fascinating to me, and I decided to dive in.
Primarily, my focus was on generating a few thought-provoking visualizations, to see if they might suggest directions for more in-depth exploration.
The data were downloaded from data.openoakland.org.
Additional background on the dataset is available on Rik Belew’s blog.
Background on the dataset’s CrimeCat classifications can be found on this explainer page.
More detailed background on the dataset is available in Dataset Description section at the end of this file.
In preparation for analysis, I loaded in the dataset of interest, and glanced at summary information about it (output suppressed).
I also created two variables to potentially use throughout subsequent plots.
An any_crime dummy variable. For this, any record where Desc and CrimeCat are not blank is assigned a 1 value. If a record doesn’t even have this minimal information about the crime, I think it’s better to exclude it as an incident.
Add column date_format with date representation of Date string
As a first basic attempt to visualize the data, I plotted a histogram of crime reports, using a binwidth of 30 days.
From the histogram, we see indications that crime reports have fallen significantly in Oakland over the last 7 years, as report counts per 30-day period in 2007 and 2008 appear to number between 9,000 and 10,000, while counts per period have been around 4,000 in recent years.
I was also curious how crimes were distributed throughout the day over the period in question. To see the dynamics, I plotted another histogram by time of day, with hourly bins, creating an hour variable to more easily bin incidents.
The hourly distribution suprised me - first by the huge spike at hour 0 (e.g. midnight to 1am). I suspected that this might be related to data entry issues, so I also checked the composition of crimes that were committed at hour zero, loading them from the original Time variable. From this, I discovered that 98,823 of the 118,590 one_am records have the value exactly “0:00:00”. This strengthened my suspicion about the spike being driven by coding issues.
Relatively confident that most of the 98,823 records occurring at midnight were probably set in the absence of a known time for a crime, or as a result of time recording errors, I decided to run the histogram on a subset that removed the ‘exactly midnight’ crimes.
## 0:00:00 0:30:00 0:01:00 0:15:00 0:05:00 0:45:00 0:10:00 0:20:00 0:50:00
## 98823 3030 2517 1215 1000 913 876 766 641
## 0:40:00 0:25:00 0:35:00 0:55:00 0:02:00 0:08:00 0:06:00 0:48:00 0:23:00
## 635 461 437 410 207 198 188 186 185
## 0:03:00 0:17:00 0:16:00 0:13:00 0:22:00 0:19:00 0:31:00 0:39:00 0:29:00
## 172 170 164 162 162 156 156 156 155
## 0:04:00 0:44:00 0:43:00 0:42:00 0:18:00 0:14:00 0:53:00 0:24:00 0:52:00
## 153 153 152 150 149 147 147 146 146
## 0:07:00 0:37:00 0:32:00 0:38:00 0:12:00 0:33:00 0:34:00 0:21:00 0:11:00
## 145 145 144 144 143 143 143 141 140
## 0:58:00 0:51:00 0:27:00 0:28:00 0:36:00 0:41:00 0:59:00 0:54:00 0:47:00
## 139 134 129 129 127 122 122 121 120
## 0:09:00 0:46:00 0:56:00 0:49:00 0:57:00 0:26:00 0:01:19 0:03:59 0:04:16
## 118 114 112 111 110 98 1 1 1
## 0:08:05 0:08:30 0:10:25 0:12:22 0:19:12 0:33:02 0:33:46 0:35:20 0:42:49
## 1 1 1 1 1 1 1 1 1
## 1:00:00 1:00:07 1:01:00 1:01:53 1:02:00 1:03:00 1:04:00 1:04:30 1:05:00
## 0 0 0 0 0 0 0 0 0
## 1:06:00 1:06:10 1:07:00 1:08:00 1:09:00 1:10:00 1:11:00 1:12:00 1:13:00
## 0 0 0 0 0 0 0 0 0
## 1:14:00 1:15:00 1:16:00 1:17:00 1:18:00 1:19:00 1:19:27 1:20:00 1:21:00
## 0 0 0 0 0 0 0 0 0
## (Other)
## 0
In the modified histogram, I was also somewhat surprised that crimes don’t spike to a greater extent in the evenings, with incident rates staying fairly uniform from 10am to 4pm, then rising to a moderate peak around 7pm and declining from there to a minimum at the 5am hour.
Next, I wanted to explore the distribution of crime by latitude and longitude, with histograms for each variable.
After a first glance at raw plots for each, I realized I needed to remove a few obviously incorrect entries, to arrive at the histograms below, and obtained more useful ranges for a chart by summarizing reasonable values for Lat.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 37.35 37.77 37.79 37.79 37.81 38.34
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -123.0 -122.3 -122.2 -122.2 -122.2 -122.1
Crime seems to skew toward the mid-Northern and Western parts of the city, based on the histogram.
For a more detailed view of geographic distribution, I decided to use ggmap to plot a basic heatmap of crime in Oakland for the period in question.
In the heatmap, we see a strong concentration of crimes in the downtown area, with a pocket also slightly to the northwest along San Pablo Avenue, then all along International Avenue to the southeast.
For another view of Oakland crime dynamics, I created a line plot of incidents per day, using the dplyr aggregation techniques we learned in the Data Analysis in R course.
Note: creating new dataframe based on incidents per day, using dplyr ‘verbose’ method, then using this in line plot.
In the daily line plot, the decline of crime reports over time is clearly visible.
As daily crime incident counts are fairly noisy, I wanted to take a look at potentially smoother time increments - creating a line plot of any_crime incidents each month.
To create the monthly plot, I cut our date_format data in monthly units, using the syntax we learned in lesson 5 of the Data Analysis in R course, then aggregated crime data by month. To do so, I applied dplyr methods again, but this time using the ‘concise’ syntax.
The downward trend in crime reports is again evident in plotted monthly crime report data. Another striking feature of the chart is the large vertical drop in incident count at the beginning of 2014.
It seems very strange to me that the reported number of crimes would be fairly constant throughout 2012 and 2013, fall by around 40% as the year changed, then persist at a largely constant lower level around 3,700 incidents per month for the next year. I wonder if there was a major change in the way crimes were recorded or reported starting at the beginning of 2014.
To better understand the distribution of daily crime rates, I also plotted mean, median and quartile measures of daily crime rates.
Also - to note - I was seeing several report data points for dates in the future. I assume these were the result of data entry errors. For all plots going forward, I have chosen to subset by incidents occurring prior to a ‘cut-off’ date of June 2015.
Mean and median daily crime rates:
After having uncovered some interesting insights about local crime trends at an aggregate level, I wanted to dive down into crime segments, using some of the multi-variate visualization techniques we covered in the later lessons of Data Analysis in R
To understand how I could best group crime report incidents by description, I surveyed all the potential values appearing in the columns CrimeCat and Desc.
I assessed unique(crimes$Desc) but suppressed the output for this column because there are about 1800 unique descriptions in the fields. This would not be a useful coulmn to use in grouping for plots.
On the other hand, the CrimeCat column includes about 60 unique categories as verified by unique(crimes$CrimeCat) This number is too large for tractable visualizations, but I decided I could group these into a smaller number of main categories (variable - mainCat) using grepl string matching.
Thus I grouped crimes by homicide, robbery/larceny, assault, rape, weapons, domestic violence, traffic and court viaolations and ‘quality of life’ (this was a class of crimes noted in the dataset, which included somewhat minor violations including ‘curfew-loitering’, drug possession and incidents related to public liquor consumption).
I then surveyed examples from the output with head(crimes,50) and tail(crimes,50) to ensure the new column assignment worked correctly.
As in previous examples, I created a new dataframe with dplyr functions, this time grouping by both date and the new mainCat variable.
The results are fairly messy with daily measures, so I decided to create a similar line plot with mainCat groupings but monthly crime incidents on the x-axis. For this I built a new crime_types_by_month dataframe, as seen below.
Interesting trends are visible when grouping incidents by type in a single plot, but the output is still fairly messy and dynamics for some categories are hard to discern, as the scales of total incidents in each crime category are substantially different.
Thus, I decided to re-plot crime_types_by_month in a facet wrap with ‘free_y’ scale to better view dynamics by crime type.
I found this chart to be striking - as incidents for all the main crime categories appear to have dropped significantly, while each shows a different pattern of decline. Some categries like Assault and Robbery plummeted around 2010 then remained steady, with others - like Homicide, Traffic, Domestic Violence and Other showing big drops later, around 2013 and 2014.
The strange discontinuity at the 2014 year mark I noted earlier is also present in these (the later declining) categories, with incident counts holding steady in 2014 and 2015 after falling massively from much higher levels immediately before 2013 year-end.
Besides tracking crime dynamics by type, I also wanted to explore how crimes were distributed geographically by Police Beats. To do so, I used the Beat variable, which indicates where the crime occurred/was recorded. More detail on the ‘Beat’ variable is available in the Dataset description below and at this link. There are 135 distinct police beats included as Beat values, with some values strictly numeric (like ‘31’) and others having alphanumeric identifiers (like ‘26X’).
Links are available, as provided by the Oakland Police Department (here here)
Similar to the previous exploration, I built a new dataframe, grouping on incidents per month and Beat.
First I plotted month and beat dynamics on one chart, associating each beat with a color. As seen, result was far too crowded to be useful. I thought perhaps a stacked bar chart of incidents by beat might show more insight - also plotted below. Clear insight wasn’t visible in the column plots either.
Going forward I decided that there are too many police beats to effectively display in a chart, and wanted to take the top 20 beats by total crime reports, then group incidents in other beats all under ‘Other’. As you’ll see in calculations below - these top 20 beats cover about 52% of all crime reports with descriptions (our any_crime variable).
First I created dataframe with total crime incidents regardless of date.
Then I added another column to the crimes - mainBeat, perserving the value for the top 20 beats by incidents, and labeling ‘other’ for all other beats.
Following the transformation, I revised the crime_by_month_and_beat dataframe with the mainBeat variable to produce clearer output, re-plotting with facet wrap to more clearly see the crime incident dynamic by beat.
## [1] 0.5221023
## month mainBeat incidents n
## Min. :2007-01-01 04X : 101 Min. : 41.0 Min. : 42.0
## 1st Qu.:2009-02-01 06X : 101 1st Qu.: 123.0 1st Qu.: 130.0
## Median :2011-03-01 07X : 101 Median : 164.0 Median : 175.0
## Mean :2011-03-02 08X : 101 Mean : 309.7 Mean : 326.4
## 3rd Qu.:2013-04-01 19X : 101 3rd Qu.: 223.0 3rd Qu.: 232.0
## Max. :2015-05-01 20X : 101 Max. :4623.0 Max. :4855.0
## (Other):1515
Again, in the per-Beat facet view we see uniformly down-trending crime incidence over time, with some beats experiencing more pronounced local spikes up around mid- to end of 2013. For some beats, including 04X, 08X, 20X and others, we also see the strange discontinuity at the end of 2013 that appeared in other views of monthly crime over time.
For these police beats, crime stays steady or spikes toward the end of 2013, then precipitously drops right at the new year, and remains or declines from the lower level to the present day.
Mapping trends to geography, we can see where each of the top 20 police beats are located in our map below.
Courtesy of the Oakland Police Department
Note that many of the top beats by crime incidence correspond to the ‘hotter’ regions in downtown Oakland as shown in our earlier exploratory heatmap, particularly ‘08X’, ‘04X’ and ‘06X’.
As a reminder, we can see a modified heatmap below, plotted over a Google roadmap.